Extra 2.1 - Unbalanced Data - Application 1: ProvStore Documents

Identifying owners of provenance documents from their provenance network metrics.

In this notebook, we compared the classification accuracy on unbalanced (original) ProvStore dataset vs that on a balanced ProvStore dataset.

Goal: To determine if the provenance network analytics method can identify the owner of a provenance document from its provenance network metrics.
Training data: In order to ensure that there are sufficient samples to represent a user's provenance documents the Training phase, we limit our experiment to users who have at least 20 documents. There are fourteen such users (the authors were excluded to avoid bias), who we named $u_{1}, u_{2}, \ldots, u_{14}$. Their numbers of documents range between 21 and 6,745, with the total number of documents in the data set is 13,870.
Classification labels: $\mathcal{L} = \left\{ u_1, u_2, \ldots, u_{14} \right\} $, where $l_{x} = u_i$ if the provenance document $x$ belongs to user $u_i$. Hence, there are 14 labels in total.

Reading data

For each provenance document, we calculate the 22 provenance network metrics. The dataset provided contains those metrics values for 13,870 provenance documents along with the owner identifier (i.e. $u_{1}, u_{2}, \ldots, u_{14}$).



In [1]:

    
import pandas as pd



In [2]:

    
df = pd.read_csv("provstore/data.csv")
df.head()









    Out[2]:







  
    
      
      label
      entities
      agents
      activities
      nodes
      edges
      diameter
      assortativity
      acc
      acc_e
      ...
      mfd_e_a
      mfd_e_ag
      mfd_a_e
      mfd_a_a
      mfd_a_ag
      mfd_ag_e
      mfd_ag_a
      mfd_ag_ag
      mfd_der
      powerlaw_alpha
    
  
  
    
      0
      u_3
      17
      5
      9
      31
      49
      6
      -0.196362
      0.444709
      0.466667
      ...
      5
      8
      4
      2
      5
      0
      0
      0
      3
      -1.0
    
    
      1
      u_2
      7
      0
      2
      9
      0
      -1
      -1.000000
      0.000000
      0.000000
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      -1
      -1.0
    
    
      2
      u_2
      7
      0
      2
      9
      0
      -1
      -1.000000
      0.000000
      0.000000
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      -1
      -1.0
    
    
      3
      u_2
      7
      0
      2
      9
      0
      -1
      -1.000000
      0.000000
      0.000000
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      -1
      -1.0
    
    
      4
      u_2
      7
      0
      2
      9
      0
      -1
      -1.000000
      0.000000
      0.000000
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      -1
      -1.0
    
  

5 rows × 23 columns



In [3]:

    
df.describe()









    Out[3]:







  
    
      
      entities
      agents
      activities
      nodes
      edges
      diameter
      assortativity
      acc
      acc_e
      acc_a
      ...
      mfd_e_a
      mfd_e_ag
      mfd_a_e
      mfd_a_a
      mfd_a_ag
      mfd_ag_e
      mfd_ag_a
      mfd_ag_ag
      mfd_der
      powerlaw_alpha
    
  
  
    
      count
      13870.000000
      13870.00000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      ...
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
      13870.000000
    
    
      mean
      9.913338
      2.08695
      1.836193
      13.836482
      19.212689
      0.868926
      -0.628690
      0.347835
      0.341142
      0.323606
      ...
      1.312761
      1.754939
      1.073540
      0.709229
      0.752127
      0.017448
      0.014924
      0.030353
      2.185436
      -0.916534
    
    
      std
      28.931915
      2.27716
      18.570823
      43.352894
      134.640366
      1.943905
      0.376718
      0.394531
      0.409577
      0.395727
      ...
      1.769329
      1.314874
      1.622606
      1.343363
      1.077628
      0.200902
      0.152351
      0.209759
      5.211118
      0.612437
    
    
      min
      0.000000
      0.00000
      0.000000
      1.000000
      0.000000
      -1.000000
      -1.000000
      0.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      -1.000000
      -1.000000
    
    
      25%
      2.000000
      1.00000
      0.000000
      5.000000
      5.000000
      -1.000000
      -1.000000
      0.000000
      0.000000
      0.000000
      ...
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1.000000
      -1.000000
    
    
      50%
      4.000000
      1.00000
      1.000000
      7.000000
      9.000000
      1.000000
      -0.592949
      0.000000
      0.000000
      0.000000
      ...
      1.000000
      2.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      2.000000
      -1.000000
    
    
      75%
      5.000000
      3.00000
      2.000000
      10.000000
      13.000000
      2.000000
      -0.350000
      0.674147
      0.750000
      0.666667
      ...
      2.000000
      2.000000
      2.000000
      1.000000
      1.000000
      0.000000
      0.000000
      0.000000
      2.000000
      -1.000000
    
    
      max
      1188.000000
      51.00000
      1580.000000
      2776.000000
      6853.000000
      10.000000
      1.000000
      1.000000
      1.000000
      1.000000
      ...
      52.000000
      44.000000
      51.000000
      52.000000
      43.000000
      4.000000
      5.000000
      6.000000
      303.000000
      8.184413
    
  

8 rows × 22 columns



In [4]:

    
# The number of each label in the dataset
df.label.value_counts()









    Out[4]:





u_3     6745
u_8     4449
u_5     1327
u_2      487
u_12     312
u_14     150
u_9      141
u_6       71
u_7       66
u_4       34
u_1       25
u_11      21
u_10      21
u_13      21
Name: label, dtype: int64

Classification on unbalanced (original) data



In [5]:

    
from analytics import test_classification

Cross Validation tests: We now run the cross validation tests on the dataset (df) using all the features (combined), only the generic network metrics (generic), and only the provenance-specific network metrics (provenance). Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.



In [6]:

    
results, importances = test_classification(df)









    



Accuracy: 96.45% ±0.0209 <-- combined
Accuracy: 95.36% ±0.0241 <-- generic
Accuracy: 96.55% ±0.0209 <-- provenance

Classification on balanced data



In [7]:

    
from analytics import balance_smote

Balancing the data

With an unbalanced like the above, the resulted trained classifier will typically be skewed towards the majority labels. In order to mitigate this, we balance the dataset using the SMOTE Oversampling Method.



In [8]:

    
df = balance_smote(df)









    



Original data shapes: (13870, 22) (13870,)
Balanced data shapes: (94430, 22) (94430,)



In [9]:

    
results_bal, importances_bal = test_classification(df)









    



Accuracy: 98.14% ±0.0079 <-- combined
Accuracy: 92.27% ±0.0159 <-- generic
Accuracy: 98.13% ±0.0082 <-- provenance

Result: The classifiers provide a higher performance on balanced data when provenance-specific metrics are used (either with the combined or provenance metrics sets). The classifiers trained on the generic metrics set, however, performs better on the original, unbalanced data. It is, perhaps, some of the minority labels have more distinctive provenance-specific metrics, compared to their generic one; when more such samples are introduced in the balacing process, using only generic metrics cannot identify those samples as well, hence a lower accuracy.

	label	entities	agents	activities	nodes	edges	diameter	assortativity	acc	acc_e	...	mfd_e_a	mfd_e_ag	mfd_a_e	mfd_a_a	mfd_a_ag	mfd_der	powerlaw_alpha
0	u_3	17	5	9	31	49	6	-0.196362	0.444709	0.466667	...	5	8	4	2	5	3	-1.0
1	u_2	7	0	2	9	0	-1	-1.000000	0.000000	0.000000	...	0	0	0	0	0	-1	-1.0
2	u_2	7	0	2	9	0	-1	-1.000000	0.000000	0.000000	...	0	0	0	0	0	-1	-1.0
3	u_2	7	0	2	9	0	-1	-1.000000	0.000000	0.000000	...	0	0	0	0	0	-1	-1.0
4	u_2	7	0	2	9	0	-1	-1.000000	0.000000	0.000000	...	0	0	0	0	0	-1	-1.0

	entities	agents	activities	nodes	edges	diameter	assortativity	acc	acc_e	acc_a	...	mfd_e_a	mfd_e_ag	mfd_a_e	mfd_a_a	mfd_a_ag	mfd_ag_e	mfd_ag_a	mfd_ag_ag	mfd_der	powerlaw_alpha
count	13870.000000	13870.00000	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000	...	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000	13870.000000
mean	9.913338	2.08695	1.836193	13.836482	19.212689	0.868926	-0.628690	0.347835	0.341142	0.323606	...	1.312761	1.754939	1.073540	0.709229	0.752127	0.017448	0.014924	0.030353	2.185436	-0.916534
std	28.931915	2.27716	18.570823	43.352894	134.640366	1.943905	0.376718	0.394531	0.409577	0.395727	...	1.769329	1.314874	1.622606	1.343363	1.077628	0.200902	0.152351	0.209759	5.211118	0.612437
min	0.000000	0.00000	0.000000	1.000000	0.000000	-1.000000	-1.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-1.000000	-1.000000
25%	2.000000	1.00000	0.000000	5.000000	5.000000	-1.000000	-1.000000	0.000000	0.000000	0.000000	...	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	-1.000000
50%	4.000000	1.00000	1.000000	7.000000	9.000000	1.000000	-0.592949	0.000000	0.000000	0.000000	...	1.000000	2.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	2.000000	-1.000000
75%	5.000000	3.00000	2.000000	10.000000	13.000000	2.000000	-0.350000	0.674147	0.750000	0.666667	...	2.000000	2.000000	2.000000	1.000000	1.000000	0.000000	0.000000	0.000000	2.000000	-1.000000
max	1188.000000	51.00000	1580.000000	2776.000000	6853.000000	10.000000	1.000000	1.000000	1.000000	1.000000	...	52.000000	44.000000	51.000000	52.000000	43.000000	4.000000	5.000000	6.000000	303.000000	8.184413